Prodding the ROC Curve: Constrained Optimization of Classifier Performance
نویسندگان
چکیده
When designing a two-alternative classifier, one ordinarily aims to maximize the classifier’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classifier achieve a specified correct acceptance or correct rejection rate (i.e., that it achieve a fixed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classifier subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difficulty of the problem and the financial return that it achieves to the service provider. When designing a classifier, one must specify an objective measure by which the classifier’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassifications. If the cost of a classification error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classifier’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classifier that outputs class posteriors. To explain the ROC curve, a classifier can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classifier: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classifier’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reflects the discriminative ability of the classifier. The dashed curve in Figure 1a is therefore a better classifier than the solid curve. The degree to which the curve is bowed can be quantified by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classifier to maximize either the ROC area or d’ often yields the same result as training a classifier to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reflect a classifier’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classifier can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classifier’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classifier. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classifier and the nature of the domain. Thus, it is conceivable that a classifier could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacrificing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 20 40 60 80 100 0 20 40 60 80 10 0
منابع مشابه
Upper and Lower Bounds of Area Under ROC Curves and Index of Discriminability of Classifier Performance
Area under an ROC curve plays an important role in estimating discrimination performance – a well-known theorem by Green (1964) states that ROC area equals the percentage of correct in two-alternative forcedchoice setting. When only single data point is available, the upper and lower bound of discrimination performance can be constructed based on the maximum and minimum area of legitimate ROC c...
متن کاملDesigning Classifier Ensembles with Constrained Performance Requirements
Classification requirements for real-world classification problems are often constrained by a given true positive or false positive rate to ensure that the classification error for the most important class is within a desired limit. For a sufficiently high true positive rate, this may result in the set-point being located somewhere in the flat portion of the ROC curve where the associated false...
متن کاملOptimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic
When the goal is to achieve the best correct classification rate, cross entropy and mean squared error are typical cost functions used to optimize classifier performance. However, for many real-world classification problems, the ROC curve is a more meaningful performance measure. We demonstrate that minimizing cross entropy or mean squared error does not necessarily maximize the area under the ...
متن کاملReliable Classifiers in ROC Space
The performance of a classifier can be improved by abstaining on uncertain instance classifications. The transformation from the original Receiver Operator Characteristic (ROC) curve to the curve obtained by abstention is provided. We include proofs on dominance of this new ROC curve to aid classifier selection and to show the effectiveness of the approach. For specific cost and class distribut...
متن کاملTurning the hyperparameter of an AUC-optimized classifier
The Area under the ROC curve (AUC) is a good alternative to the standard empirical risk (classification error) as a performance criterion for classifiers. While most classifier formulations aim at minimizing the classification error, few methods exist that directly optimize the AUC. Moreover, the reported methods that optimize the AUC are often not efficient even for moderately sized datasets. ...
متن کامل